Scale NN distances and sort by distance and index by grovduck · Pull Request #105 · lemma-osu/sknnr

grovduck · 2026-01-22T17:21:30Z

Due to floating-point differences between operating systems when using np.dot, very close neighbor distances can swap position leading to differences in neighbor indexes and, thus, predictions. This PR first scales and discretizes distances such that very similar or exact distances will share the same value in which case they are subsequently sorted by neighbor index.

This change affected the neighbor sorting of four RFNN tests such that, in the case of very close or tied distances, the neighbor with the smaller index was chosen as the nearer neighbor. The fix is the result of using np.dot to calculate Hamming distances. No other estimators' regression tests were affected by this change.

@aazuspan, let me know what you think of this fix. I went back and forth on surfacing the EPS value as a parameter. For now, it is not configurable, but I don't have a strong feeling as to whether a user would want this flexibility. Also, if you have a sense whether this is an inefficient calculation, please let me know. I didn't notice that tests slowed down much, but there may be some optimization that I'm not aware of.

Lastly, even though I think this should only impact estimators that use Hamming distance, I decided to implement in RawKNNRegressor.kneighbors thinking that this defined sorting behavior would be nice for all estimators. The one place that I can see where this would trip users up is in cross-validation if one or more samples have identical feature values. The sample with the lower index will still be its own nearest neighbor, whereas the sample with the higher index would now be at best the second nearest neighbor. In our experience, co-located plots can sometimes exhibit this behavior.

PS: Even though we were surprised not to see this bug when working on #101, it did come back to bite us in current work on #99. I'd like to get this fix in place before going back to that PR.

Due to floating-point differences between operating systems when using `np.dot`, very close neighbor distances can swap position leading to differences in neighbor indexes and, thus, predictions. This commit first scales and dicretizes distances such that very similar or exact distances will share the same value in which case they are subsequently sorted by neighbor index.

Copilot

Pull request overview

This PR addresses floating-point precision issues in nearest neighbor distance calculations by implementing a deterministic sorting approach. When distances are very similar due to platform-specific floating-point differences in np.dot, the new implementation scales and discretizes distances, then sorts first by scaled distance and then by neighbor index to ensure consistent ordering across platforms.

Changes:

Modified RawKNNRegressor.kneighbors to scale distances by epsilon (1e-10) and use lexicographic sorting by distance then index
Updated four RFNN regression test files to reflect the new deterministic neighbor ordering

Reviewed changes

Copilot reviewed 1 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
`src/sknnr/_base.py`	Added distance scaling and lexicographic sorting to ensure deterministic neighbor ordering when distances are very close
`tests/test_regressions/test_mixed_rfnn_forests_target_.npz`	Updated expected test outputs for RFNN with new sorting behavior
`tests/test_regressions/test_mixed_rfnn_forests_reference_.npz`	Updated expected test outputs for RFNN reference data with new sorting behavior
`tests/test_regressions/test_kneighbors_reference_full_randomForest_k5_index_.npz`	Updated expected test outputs for kneighbors with new sorting behavior
`tests/test_regressions/test_kneighbors_reference_full_randomForest_k5_ids_.npz`	Updated expected test outputs for kneighbors with dataframe ids and new sorting behavior

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/sknnr/_base.py

aazuspan · 2026-01-22T19:21:00Z

@grovduck, I am slightly hesitant about the fix, but I'm more than willing to defer to your experience here if you think this is the right solution. I think most of my reservations are because this seems like we're changing behavior of the library code for the sake of the test code, but maybe that's off-base and it would be valuable more generally to improve reproducibility across platforms?

In any case, I'm also not sure what harm this could cause, aside from a negligible loss in precision in neighbor comparisons (presumably the difference between two neighbors past 10 decimal places is effectively meaningless).

I went back and forth on surfacing the EPS value as a parameter. For now, it is not configurable, but I don't have a strong feeling as to whether a user would want this flexibility.

If we had some kind of configuration API (something like pandas.options) I could see that being a good place for this, but as a parameter, I could see that the benefit of flexibility might be outweighed by the difficulty explaining it.

You could consider adding a parameter to enable/disable it, but I'm not sure whether that's worth it either.

Also, if you have a sense whether this is an inefficient calculation, please let me know. I didn't notice that tests slowed down much, but there may be some optimization that I'm not aware of.

It looks nicely vectorized to me. I was curious about the performance impact and ran some quick tests with 4000 samples, 10 features, and 10 neighbors, and the added code for scaling, rounding, sorting, and re-indexing only increased the kneighbors time from 194 ms to 195.7 ms, so it seems almost negligible.

Lastly, even though I think this should only impact estimators that use Hamming distance, I decided to implement in RawKNNRegressor.kneighbors thinking that this defined sorting behavior would be nice for all estimators.

Yeah, I agree that being consistent across estimators makes sense.

The one place that I can see where this would trip users up is in cross-validation if one or more samples have identical feature values. The sample with the lower index will still be its own nearest neighbor, whereas the sample with the higher index would now be at best the second nearest neighbor. In our experience, co-located plots can sometimes exhibit this behavior.

I don't think I'm totally following... Does this mean that you might somehow compare a feature against itself during cross-validation, whereas you'd otherwise get an independent feature? Or is it the difference between which of two identical features you'd identify as the first neighbor?

grovduck · 2026-01-22T23:16:55Z

@grovduck, I am slightly hesitant about the fix, but I'm more than willing to defer to your experience here if you think this is the right solution. I think most of my reservations are because this seems like we're changing behavior of the library code for the sake of the test code, but maybe that's off-base and it would be valuable more generally to improve reproducibility across platforms?

@aazuspan, I can completely understand your hesitancy here and I appreciate you making me elaborate on why I think it's needed. Definitely the impetus for introducing the fix was failing tests and I concede that point. But I think the bigger issue is that we need consistency across platforms for neighbor order, which impacts predictions and scores. My thought is that the neighbor assignment needs to be deterministic across all platforms/users and the tests exposed that this wasn't the case.

We're obviously seeing this behavior of tied distances much more in the estimators that use Hamming distance because its simply a proportion of matched nodes between target and reference (as we've discussed before). If we use equally weighted trees (the default in RFNN and an option for GBNN), the probability of tied distances is even higher.

KNeighborsRegressor itself warns that "Regarding the Nearest Neighbors algorithms, if it is found that two neighbors, neighbor k+1 and k, have identical distances but different labels, the results will depend on the ordering of the training data."

Looking at this example with two identical features:

import numpy as np
from sklearn.neighbors import KNeighborsRegressor

X = np.array([
    [1, 2, 3],
    [4, 5, 6],
    [1, 2, 3]
])
y = np.array([10, 20, 30])
est = KNeighborsRegressor(n_neighbors=2).fit(X, y)

# Independent neighbors
print(est.kneighbors()[1])
# [[2 1]
#  [0 2]
#  [0 1]]

# Dependent neighbors
print(est.kneighbors(X)[1])
# [[0 2]
#  [1 0]
#  [0 2]] - Not returning itself as first neighbor

I feel what I'm proposing here is relaxing the criterion on what makes a tie distance to correct for numerical imprecision.

I think my main argument is that, given the exact same inputs and parameters, estimators should be deterministic. Perhaps what you are thinking is that the platform, package versions, etc. are part of those inputs/parameters and that they are deterministic in that sense. What I'm proposing in this fix is a "broadening" of that determinism to account for some of these differences (and, of course, to make tests pass!) But please push back if you don't think this the right way to go about this. I just haven't come up with other options.

In any case, I'm also not sure what harm this could cause, aside from a negligible loss in precision in neighbor comparisons (presumably the difference between two neighbors past 10 decimal places is effectively meaningless).

Yes, I think this is what I'd argue as well. Distances past x decimal places are meaningless, so we may as well set neighbors in a way that will be reproduceable.

If we had some kind of configuration API (something like pandas.options) I could see that being a good place for this, but as a parameter, I could see that the benefit of flexibility might be outweighed by the difficulty explaining it.

You could consider adding a parameter to enable/disable it, but I'm not sure whether that's worth it either.

Your second thought here is intriguing. That would allow tests to pass and allay my concerns about reproducibility (and I think I'd make the proposed fix the default), but would still allow a user to use the true calculated distances. We'd have to surface the parameter on TransformedKNeighborsRegressor.kneighbors as well, but not terribly disruptive. Let me know if you think that's worth it.

It looks nicely vectorized to me. I was curious about the performance impact and ran some quick tests with 4000 samples, 10 features, and 10 neighbors, and the added code for scaling, rounding, sorting, and re-indexing only increased the kneighbors time from 194 ms to 195.7 ms, so it seems almost negligible.

Thanks for going to the trouble of doing that testing. Good to know that the effect is negligible.

I don't think I'm totally following... Does this mean that you might somehow compare a feature against itself during cross-validation, whereas you'd otherwise get an independent feature? Or is it the difference between which of two identical features you'd identify as the first neighbor?

Sorry, that explanation wasn't totally clear and using cross-validation as a term in my explanation didn't help. But it's the latter clause that could cause issues. I'll try to give a clear example that isn't necessarily related to cross-validation.

In our GNN modeling, we capture dependent neighbors/predictions as well as independent neighbors/predictions. Dependent predictions are when a plot can use itself as a neighbor. As one of our tests to identify plots for screening, we want to know how "deep" in the neighbor stack a plot had to go to self-impute¹. In sknnr as currently defined, this is always the first neighbor given that the distance between a plot and itself will always be zero. With the proposed fix, that self-imputation could be as deep as the number of identical features -or- even the number of features that evaluated to the same distance (think tied number of matches in RFNN). So a feature that had an index of 3000 would self-impute at neighbor 3 if indices 1000 and 2000 had the same rounded distance. This wouldn't come up in LOGO cross-validation, but it would still impact this outlier screen.

In our experience, it's very rare to have samples that share all feature values so this wouldn't come up often. There are a couple of other corner cases that come up, but I'm recognizing that my explanation is pretty rambling at this point and it may be easier to talk it through.

This is complicated by the fact that we do this evaluation at pixel scale, with nine pixels per plot, so pixel X values do not necessarily equal plot X values. We then take a mean of these self-imputation ranks across the nine pixels. High values can show areas where a plot footprint is quite heterogeneous and perhaps not suitable for modeling. ↩

aazuspan · 2026-01-23T17:01:20Z

Thanks for the details on the motivation and how you see this fitting into sklearn. I'm convinced that it goes beyond a patch for our testing framework and addresses a real limitation of sklearn, so I'm onboard for the change :)

That would allow tests to pass and allay my concerns about reproducibility (and I think I'd make the proposed fix the default), but would still allow a user to use the true calculated distances. We'd have to surface the parameter on TransformedKNeighborsRegressor.kneighbors as well, but not terribly disruptive. Let me know if you think that's worth it.

That was my thought. I imagine the parameter won't have any effect in most cases, but as a user I think I would feel more comfortable knowing it was there just in case there was something special about my data that caused issues (the only thing I can think of would be a dataset where most of the distances are very small with a few very large distances that lead to a huge scaler and a loss of meaningful precision at the low end, but I'm not sure that would ever arise in reality). Having to pass the parameter through TransformedKNeighborsRegressor is inconvenient, though.

With the proposed fix, that self-imputation could be as deep as the number of identical features -or- even the number of features that evaluated to the same distance (think tied number of matches in RFNN). So a feature that had an index of 3000 would self-impute at neighbor 3 if indices 1000 and 2000 had the same rounded distance.

Thanks, I think I'm mostly understanding now. Just as a thought, what if we broke distance ties based on the absolute "distance" between their indices, so that if plot ID 3000 had identical features to 1000 and 4000, the nearest neighbors would be [3000, 4000, 1000]? I had to tweak the example there since there's a chance to tie with the distance between indices as well, so I suppose we'd need a third tie-breaker with the index itself. That may be too targeted of a fix for such a specific issue and moves further from sklearn behavior, so I don't know if it's a good idea or not.

When ordering neighbors, distances are used as the first screen. If distances are considered to be equal and `use_deterministic_ordering` is `True`, use the absolute neighbor index difference as the second-level ordering. This ensures that when `kneighbors` is called with the `X` used to fit the estimator (i.e. non-independent cross-validation), the first neighbor returned for each query point will be itself.

grovduck · 2026-01-27T17:25:56Z

@aazuspan, I've taken another hack at this to address the three remaining issues:

Make DISTANCE_PRECISION_DECIMALS a class attribute. I settled on this name to define to number of decimals to be used when scaling and rounding distances. Not sure I love the name, so happy if you have other suggestions. This is defined as a class attribute on RawKNNRegressor and I added documentation under the Attributes section. I've also added a test (test_estimators.py::test_kneighbors_precision_decimals) that monkey-patches this value. I've never written a test with monkey-patching before, so please let me know if this isn't the correct way to go about it.
Expose use_determinintic_ordering as a parameter. This is added to both RawKNNRegressor and TransformedKNeighborsRegressor as a pass-through. Again, open to different naming on the parameter. If this value is True, it runs through the deterministic ordering and, if False, it uses the default sorting of sklearn's KNeighborsRegressor.kneighbors. Again, I have a new test (test_estimators.py::test_kneighbors_deterministic_ordering) that tests the effect of setting this flag. Note that this test and the first test are very similar, so we could perhaps collapse into one. I went back and forth on this and it was easier for me to understand the tests if they were separated.
Add absolute neighbor index difference as second-level sort. I liked your idea of adding this as a secondary sort and now have this implemented. I understand that this is serving a very narrow use case, but I think it's a nice property to have when doing non-independent imputation - all reference query points should have themselves as their first neighbor. Another test for this one (test_estimators.py::test_kneighbors_uses_index_difference) that shows the effect of passing two duplicate query points which differ in their neighbors based on their row index. Note that this modification once again changed some of the regression results, so I've pushed new versions of those here.

Thanks for your continued patience.

aazuspan

@grovduck, this looks great! I agree with all your decisions and naming choices, and the new tests are very thorough.

The only change I'd propose is that it might be worth adding a new subheading to the Usage guide to explain how the deterministic ordering works and where it might differ from the default sklearn behavior, since that seems like a notable new feature. The parameter docstring could then repeat some of that information and/or link to the usage guide for more info.

grovduck · 2026-01-30T17:07:47Z

The only change I'd propose is that it might be worth adding a new subheading to the Usage guide to explain how the deterministic ordering works and where it might differ from the default sklearn behavior, since that seems like a notable new feature. The parameter docstring could then repeat some of that information and/or link to the usage guide for more info.

This is a great idea. I've now added a section to the user's guide and linked from the docstrings.

Copilot

Pull request overview

Copilot reviewed 3 out of 7 changed files in this pull request and generated 6 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

docs/pages/usage.md

src/sknnr/_base.py

tests/test_estimators.py

docs/pages/usage.md

aazuspan

@grovduck, the expanded docs look great! I had a few comments for you to consider, but I'm happy whenever you're ready to merge.

docs/pages/usage.md

grovduck · 2026-01-31T00:40:11Z

@aazuspan, thanks for the final review. I think I've addressed your helpful comments and will merge when you've had a chance to quickly review.

aazuspan

LGTM! This was a sticky issue, so it's great to have it solved :)

grovduck self-assigned this Jan 22, 2026

grovduck added the bug Something isn't working label Jan 22, 2026

grovduck requested review from aazuspan and Copilot January 22, 2026 17:22

Copilot started reviewing on behalf of grovduck January 22, 2026 17:22 View session

Copilot AI reviewed Jan 22, 2026

View reviewed changes

src/sknnr/_base.py Outdated Show resolved Hide resolved

src/sknnr/_base.py Outdated Show resolved Hide resolved

src/sknnr/_base.py Outdated Show resolved Hide resolved

Revise scaling logic to rely on floating-point values

a7e7f6f

grovduck changed the title ~~Scale NN distances by epsilon and sort by distance and index~~ Scale NN distances and sort by distance and index Jan 23, 2026

grovduck added 3 commits January 26, 2026 15:23

Add DISTANCE_PRECISION_DECIMALS as class attribute

7b50ecc

Add use_deterministic_ordering parameter to kneighbors

dec9be7

aazuspan reviewed Jan 27, 2026

View reviewed changes

Add neighbor ordering section to usage guide

0cc9138

grovduck requested a review from Copilot January 30, 2026 17:07

Copilot started reviewing on behalf of grovduck January 30, 2026 17:08 View session

Copilot AI reviewed Jan 30, 2026

View reviewed changes

Fix typos and expand on docstring

89fea5d

aazuspan approved these changes Jan 30, 2026

View reviewed changes

docs/pages/usage.md Outdated Show resolved Hide resolved

docs/pages/usage.md Outdated Show resolved Hide resolved

docs/pages/usage.md Show resolved Hide resolved

grovduck added 2 commits January 30, 2026 16:22

Respond to review

0825980

Add scale information back in and describe per-row scaling

8ebb318

aazuspan approved these changes Jan 31, 2026

View reviewed changes

grovduck merged commit 4874ba7 into main Jan 31, 2026
11 checks passed

grovduck deleted the discretize-distances branch January 31, 2026 00:58

grovduck mentioned this pull request Feb 5, 2026

Implementation of Gradient Boosting Nearest Neighbors (GBNN) #99

Merged

Conversation

grovduck commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aazuspan commented Jan 22, 2026

Uh oh!

grovduck commented Jan 22, 2026

Footnotes

Uh oh!

aazuspan commented Jan 23, 2026

Uh oh!

grovduck commented Jan 27, 2026

Uh oh!

aazuspan left a comment

Choose a reason for hiding this comment

Uh oh!

grovduck commented Jan 30, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aazuspan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

grovduck commented Jan 31, 2026

Uh oh!

aazuspan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

grovduck commented Jan 22, 2026 •

edited

Loading